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We show how the integrators used for the molecular dynamics step of the Hybrid Monte Carlo 
algorithm can be further improved. These integrators not only approximately conserve some Hamil- 
tonian H but conserve exactly a nearby shadow Hamiltonian H. This property allows for a new 
tuning method of the molecular dynamics integrator and also allows for a new class of integrators 
(force-gradient integrators) which is expected to reduce significantly the computational cost of future 
large-scale gauge field ensemble generation. 



INTRODUCTION AND MOTIVATION 



Hybrid Monte Carlo (HMC) [jj is the algorithm of 
choice to generate lattice QCD configurations including 
the effect of dynamical fcrmions. The most time consum- 
ing ingredient of HMC is the molecular dynamics (MD) 
step, which consists of a reversible volume-preserving ap- 
proximate MD trajectory of t /St steps (with r being the 
length of the trajectory and St the stepsize) followed by a 
Metropolis accept/reject test with acceptance probability 
min(l, e~ 5H ) where SH is the change in the Hamiltonian 
H = T + S whose kinetic and potential parts are T and 

s. 



A molecular dynamics trajectory is not only an ap- 
proximate integral curve of the Hamiltonian vector field 
H corresponding to H, but is also an exact integral 

curve of the Hamiltonian vector field H of an exactly 
conserved shadow Hamiltonian H. The asymptotic ex- 
pansion of this shadow Hamiltonian in the stepsize St 
may be computed using the Baker-Campbell-Hausdorff 
(BCH) formula and expressed in terms of Poisson brack- 
ets (PBs) @, 0). As a simple example consider the 
PQPQP (also known as 2MN 0]) integrator 



PQPQP 



(r)=( 



TSt (1-2A)S<5t ATSt XS&r 



t I 5t 



whose shadow Hamiltonian is 



H PQPQP = H+ ( 6A2 ~ 6 2 A + 1 {3, {S, T}} + Izftr, {S, T}} ) ^ 



(1) 
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Note that we have one free parameter, A, which is often 
set to some value not taking PBs into account. In [5J, the 
authors chose A by minimising SH empirically, requiring 
a sequence of runs at different values of A. Others used 
A c w 0.193183 Q, which minimizes the norm of the co- 
efficients of the PBs in the second-order term. However, 
this is not necessarily the best choice. 

We have evaluated PBs and shadow Hamiltonians for 
gauge theories (where gauge fields are constrained to live 
on a Lie group manifold) for the first time 0, . There- 



fore, in this letter we propose to measure the volume- 
averaged PBs and tune the free parameters of an MD 
integrator taking PB measurements into account. As we 
will see, our tuning procedure also allows us to find out 
the best number of steps of a nested integrator scheme. 
We also present a new integrator step and a new integra- 
tor which will be able to reduce the cost of large volume 
simulations. 
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FIG. 1. Comparison of measured acceptance rates and their 
predictions from Poisson bracket measurements. In the left 
hand plot we fix St — 0.1 and leave A as a free parameter, 
whereas in the right we take A = 0.18 and plot P acc as a 
function of St. 



INTEGRATOR TUNING 



on the wall-clock time spent computing the force terms 
on a unit of MD time as well as the acceptance rate, and 
the autocorrelation time T corr for the observables. We 
neglect T corr in this discussion as it is not sensitive to the 
choice of integrator parameters as long as the acceptance 
rate is reasonable, and define our cost metric as 



cost 



trajectory CPU time 
PZ t 



(3) 



For our purposes, the numerator of eq. ([3]) is estimated 
by considering the time spent in force computation along 
the trajectory. In particular, for a nested integrator, the 
numerator of this cost function is a function of the num- 
ber of steps at each level times the CPU time required to 
compute the forces at that level. Therefore, minimizing 
eq. (J3J will allow us to both find out the optimal integra- 
tor parameters, as well as find out the optimal stepsize, 
or the number of steps at each level of a nested integrator 
scheme. This is a more direct approach than the popular 
"balancing forces" method 



Let us define the difference between the shadow (H) 
and actual (H) Hamiltonians as AH = H — H . Noting 
that Var(AP) means the variance of the distribution of 
values of AH over phase space, one can show that the 
acceptance rate P acc can be given by [8j 



P acc = crfc ^y- Var(AP) J . (2) 

To estimate P acc from eq. [3J one only needs to measure 
the PB from equilibrated configurations. This allow us 
to express P acc as a function of the integrator parameters 
and find their optimal values that maximize P acc . 

As a simple test, we consider an HMC simulation of 
two flavors of Wilson fermions at k = 0.158 and Wil- 
son gauge action at f3 = 5.6 on an 8 4 lattice. We use 
a single level PQPQP integrator and a unit trajectory 
length, therefore we have two tunable parameters: the 
integrator parameter A and the step size St. We measure 
AH up to fourth order in St. In Figure Q] we compare 
the acceptance rates predicted by the formula above with 
numerical data taken from simulations at various values 
of A and St. The PB values used for the predictions were 
measured at A = 0.18 and St — 0.1 - but one should 
note that our predictions are independent of the integra- 
tor parameters used to get the Poisson bracket values. 

The plots show good agreement between predicted 
and measured acceptance rates, provided the stepsize is 
not too big, otherwise the BCH expansion breaks down. 
Moreover, the maximum of the acceptance rate in the 
left hand plot is achieved at \ max ~ 0.1836 (to be com- 
pared with A c ). We now use eq. ([2]) to tune the MD 
integrator on a larger volume. Ultimately, we are inter- 
ested in reducing the computational cost, which depends 



TUNING A REAL SIMULATION 



Level i 


Force 


F time 


FG time 





Hasenbusch (fj, = / fj, = 0.057) 


21.21 s 


26.61 s 


1 


Hasenbusch (fj, = 0.057 / (i = 0.25) 


3.98 s 


7.55 s 


2 


Wilson O = 0.25) 


1.05 s 


1.98 s 


3 


Gauge 


0.075 s 


0.142 s 



TABLE I. Set-up used in the HMC simulation described in 
this section, together with typical times spent on force compu- 
tation, fi is the twisted mass parameter [jj]. For convenience, 
times for the force-gradient computation are also shown here. 

As an application of our tuning technology, we consider 
a HMC simulation of a 24 3 x 32 lattice, with two flavours 
of Wilson fermions with k — 0.1580 and f3 = 5.6. As 
in Q, we use a nested PQPQP integrator scheme, with 
the inclusion of two Hasenbusch fields with twisted mass 
fermions as "preconditioners" . In |9j , each nested level of 
the integrator has one force term, and the free parameter 
of the PQPQP integrator has been set to A = 1/6 at 
all levels. In Table |H for each level i (note that is 
the outermost level), we show the type of force and its 
parameters, and mean values of the time spent on force 
and force-gradient computation. 

In order to improve the integrator scheme used in j^], 
we considered two different nested schemes: 

PQ4. the original scheme, but with tuned values of A; 

PQ3. the two Hasenbusch fields appear now at the same 
level (so we have only 3 different levels). 
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Tuning 














Prediction 


Measurement 


Scheme 


mi 








\i 






F time 




Time 






12 


3 





1 


2 


3 


p acc 


/ traj. 


Pace 


/ traj. 


Cost 


Original 


3 12 


3 


1/6 


1/6 


1/6 


1/6 


0.85(1)(3) 


308 s 


0.88(4) 


405 s 


463 


PQ4 / 4th 


3 12 


1 


0.1903(36) 


0.1696(66) 


0.1885(69) 


0.1670(80) 


0.89(1)(0) 


294 s 


0.85(3) 


399 s 


471 


PQ4 / 4th 


3 11 


2 


0.1966(64) 0.1660(131) 


0.1885(35) 


0.1524(168) 


0.80(2) (0) 


267 s 


0.82(5) 


360 s 


438 


PQ3 / 4th 


3 3 2 


- 


0.1803(22) 


0.1902(53) 


0.1281(220) 


— 


0.83(3) (0) 


234 s 


0.83(4) 


345 s 


417 


PQ3 / 2nd 


3 3 2 




0.1735(25) 


0.1924(53) 


0.1415(216) 




0.81(2)(3) 


234 s 


0.72(4) 


354 s 


491 


PQ3 


3 3 2 




A c 


A c 


A c 




0.76(3) (3) 


234 s 


0.80(3) 


339 s 


425 


PQ3 


3 3 2 




1/6 


1/6 


1/6 




0.73(3) (5) 


234 s 


0.74(5) 


349 s 


473 


PQ3 / 4th 


3 3 1 




0.1793(24) 


0.1890(65) 


0.1636(66) 




0.81(2)(0) 


228 s 


0.78(5) 


342 s 


442 



TABLE II. Tuning of the PQPQP integrator scheme. All errors shown are statistical, with the exception of the second set of 
errors in the predicted acceptance rates, which are some sort of systematic error, estimated from the difference of predicting 
acceptance rates using a shadow Hamiltonian up to St 2 or St 4 . All times refer to runs utilizing 128 cores of the Iridis cluster. 



Table |TT] shows parameters which minimize the cost met- 
ric ([3]). In our simulations, we have fixed r = 1, un- 
less stated otherwise. For each scheme (which is also 
described by the highest power of St used to compute 
AH) , we show the optimal number of steps at each level, 
the optimal A parameters, our predictions for the accep- 
tance rate, the estimated time spent in force computation 
in one trajectory, and measurements of acceptance rates 
and trajectory times. For comparison we also show data 
for the original scheme Q. 

We see that all predicted acceptance rates agree, within 
errors, with the measured ones. Furthermore, the tuning 
of A's in the PQ4 scheme allows a reduction of the number 
of steps on the inner levels, so the CPU time per trajec- 
tory decreases. Moreover, the PQ3 scheme allows further 
improvement in cost measures. For this scheme, we also 
show the performance obtained using other A values. We 
conclude that, for an optimal choice of integrator param- 
eters, one is encouraged to tune the integrator using the 
best available approximation to AH . 



FORCE-GRADIENT INTEGRATOR 

Since the Poisson bracket {S, {S, T}} does not depend 
on momentum we can evaluate the integrator step 
e {S,{S,T}}Sr a ex pij c itly. if one use s A = 1/6 for the 
PQPQP integrator together with this integrator step, we 
eliminate all 0(St 2 ) terms in AH . We therefore define 
a PQPQP force- gradient integrator as 



U fg (t) 



e hS6T e ±T5T e 



- e lTST e ±SST 



r /St 



Note that the performance of this integrator has been 
shown to be much better than Campostrini integrator 
Q. This is not surprising since the coefficients of the 
0{St a ) term in the shadow Hamiltonian are about two 
orders of magnitude smaller [3|. 



In order to test the performance of this integrator, we 
considered the two integrator schemes defined in the last 
section, replacing the PQPQP integrator, at all levels, by 
this new one. We will denote these new schemes as FG4 
and FG3 which have 4 and 3 different levels, respectively. 
As in this case there are no tuneable parameters, we could 
only vary the number of steps at each level. In Table IIIII 
we show the best parameters we found as well as the 
measured values of acceptance rates and trajectory times. 













Prediction 


Measurement 


Scheme 


T 




rrn 






F + FG 




Time 











1 2 


3 


Pace 


time / traj. 


Pace 


/ traj. 


cost 


FG4 


1.0 


3 


1 1 


1 


0.97(1) 


415 s 


0.92(2) 


523 s 


569 


FG4 


1.1 


3 


1 1 


1 


0.96(1) 


415 s 


0.85(4) 


526 s 


560 


FG4 


1.2 


3 


1 1 


1 


0.94(1) 


415 s 


0.71(6) 


500 s 


585 


FG3 


1.0 


3 


1 1 




0.91(1) 


314 s 


0.80(3) 


393 s 


492 



TABLE III. Tuning of FG4 and FG3 integrator schemes. Note 
that Pace predictions are not compatible with measurements. 
We expect this to be due to higher order corrections to AH. 

A comparison between Tables [TTI and IIIII shows that the 
use of a force-gradient integrator allows a smaller number 
of steps in the inner levels. Furthermore, comparing the 4 
nested level schemes, one sees that the force-gradient ac- 
ceptance rates are higher. However, the trajectory CPU 
times also increase (because we have to evaluate one more 
step), so the cost measures are higher. 



A LARGER VOLUME 

Despite the comparison done in the last section does 
not favour the use of a force-gradient integrator, it is 
expected this integrator will be of use for larger lattice 
volumes. Thus we now consider a thermalized 40 4 simu- 
lation (using the same action parameters as used in the 
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Tuning 
Scheme 


rm 
12 3 


Ai 

12 3 


Predic 

P.cc 


Jon 

F time 
/ traj. 


Mea 

Pace 


sureme 
Time 
/ traj. 


tit 

Cost 


PQ4 / 4th 
FG4 


5 12 1 
4 111 


0.1940(32) 0.1712(53) 0.1961(50) 0.1757(64) 


0.90(1)(0) 
0.95(1) 


1819 s 
2196 s 


0.85(4) 
0.84(4) 


2143 s 
2462 s 


2513 
2931 


PQ3 / 4th 
FG3 


5 3 1- 
411- 


0.1780(18) 0.1995(49) 0.1794(55) 


0.83(2)(1) 
0.92(1) 


1513 s 
1816 s 


0.82(4) 
0.82(5) 


1934 s 
2158 s 


2355 
2641 



TABLE IV. Tuning of a 40 4 simulation. CPU times refer to runs utilizing 256 cores of the Iridis cluster. 



previous sections) and proceeded with a similar tuning 
analysis. We show the results in Table IIVI Although we 
were able to use a higher stepsize for the schemes which 
use a force-gradient integrator, the trajectory CPU times 
are still higher than PQPQP runs. 

Using the data available, one can compute ratios be- 
tween the cost of the FG schemes over PQPQP ones; we 
see they decrease for larger volumes — see Table IVl From 
these results, we can see that at some increased volume 
(assuming constant physics) the force-gradient integrator 
will become more efficient than the PQPQP integrator. 



Nested scheme 


FG4/PQ4 


FG3/PQ3 


24 3 x 32 


1.30 


1.18 


40 4 


1.17 


1.12 



TABLE V. Cost ratios (FG over PQPQP). 

This cross-over point can be estimated from appeal- 
ing to the requirement that the equilibrium distribution 
must satisfy {e SH ) = 1. Expanding to second order, we 
have (SH) ~ ^(SH 2 ), thus for second- and fourth-order 
integrators (SH) ~ St 4 and (SH) ~ <5r 8 , respectively. 
The cost has a trivial linear volume factor and scales lin- 
early with 1/St. Since SH is extensive in the volume we 
can equate it with the volume stepsize scaling, thus we 
have cost PQPQP - V 5 / 4 and cost FG ~ V 9 / 8 , so the cost 
ratio will behave as U -1 / 8 . Using the data in table fVl to 
estimate the multiplicative coefficient, we estimate that 
the force-gradient integrator becomes more efficient at 
V = 56 4 and V = 50 4 for the four-level and three-level 
integrators, respectively. Given current leading-edge lat- 
tice computations are presently at this volume, or slightly 
larger, we therefore predict that the force-gradient inte- 
grator will reduce significantly the computational cost for 
future large-scale gauge field ensemble generation. 



CONCLUSIONS 

We have presented a novel way of tuning a HMC in- 
tegrator, together with practical examples. This tuning 
procedure can be used for all lattice gauge and fermionic 



actions, and allows a systematic study of the MD in- 
tegrators currently used in large scale dynamical lattice 
simulations. We have also presented a new fourth-order 
integrator which is expected to reduce significantly the 
computational cost of HMC simulations. 

We are currently working towards a general implemen- 
tation of the calculation of Poisson brackets and force- 
gradient terms in Chroma f5~oj ] . Details about computing 



Poisson Brackets will be presented elsewhere [Tl( . In the 
near future we will also consider tuning simulations using 
other lattice actions. 
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